Skip to content

Allow caller to retrieve the URL used for the request#56

Open
lfoppiano wants to merge 5 commits into
ccfrom
feature/store-internal-url-in-response-metadata
Open

Allow caller to retrieve the URL used for the request#56
lfoppiano wants to merge 5 commits into
ccfrom
feature/store-internal-url-in-response-metadata

Conversation

@lfoppiano

Copy link
Copy Markdown

This PR (is one part of #54 covers NUTCH-3173 for okhttp-protocol and attempt to solve the problem in a generic way.

We add a new method in the Response.java interface contract getRawUrl() which returns the URL that was initially provided by the caller. getUrl() would return the actual URL used for the request.

@lfoppiano lfoppiano force-pushed the feature/store-internal-url-in-response-metadata branch from 09536d8 to d6b90e3 Compare May 20, 2026 08:34
@lfoppiano

lfoppiano commented May 20, 2026

Copy link
Copy Markdown
Author

Ok. If I have done it correctly, the getURL() now would return a different URL if the internal parsing of the plugin (whatever it is) works correctly and alter it. This should be applied before we fetch. I'm not sure this was what we discussed (there was a ? related to changing the URL eventually).

Below the different records before and after this change. One of the hops is skip, so this might be something to fix.

Before this fix:

2026-05-20 14:29:04,778 INFO o.c.u.WarcOutputFormat [pool-15-thread-1] Partition: 0
2026-05-20 14:29:04,948 INFO o.c.u.WarcRecordWriter [pool-15-thread-1] WARC response record https://harmony.one/human (2026-05-20T12:29:02Z, status: 301, size: 0)
2026-05-20 14:29:04,966 INFO o.c.u.WarcRecordWriter [pool-15-thread-1] WARC response record https://harmony.one/human (2026-05-20T12:29:03Z, status: 301, size: 0)
2026-05-20 14:29:04,967 INFO o.c.u.WarcRecordWriter [pool-15-thread-1] WARC response record https://www.h.country/ (2026-05-20T12:29:03Z, status: 301, size: 167)
2026-05-20 14:29:04,969 INFO o.c.u.WarcRecordWriter [pool-15-thread-1] WARC response record https://www.h.country/robots.txt (2026-05-20T12:29:02Z, status: 301, size: 167)
2026-05-20 14:29:04,970 INFO o.c.u.WarcRecordWriter [pool-15-thread-1] WARC response record https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots (2026-05-20T12:29:03Z, status: 404, size: 415)
2026-05-20 14:29:04,970 INFO o.c.u.WarcRecordWriter [pool-15-thread-1] WARC response record https://xn--qv9h.s.country/robots.txt (2026-05-20T12:29:03Z, status: 404, size: 415)
2026-05-20 14:29:04,971 INFO o.c.u.WarcRecordWriter [pool-15-thread-1] WARC response record https://🧠.s.country/p/human-protocol-aligning-hearts-bots (2026-05-20T12:29:02Z, status: 404, size: 415)

After this fix:

2026-05-20 14:30:48,956 INFO o.c.u.WarcRecordWriter [pool-15-thread-1] WARC response record https://harmony.one/human (2026-05-20T12:30:45Z, status: 301, size: 0)
2026-05-20 14:30:48,971 INFO o.c.u.WarcRecordWriter [pool-15-thread-1] WARC response record https://harmony.one/human (2026-05-20T12:30:47Z, status: 301, size: 0)
2026-05-20 14:30:48,973 INFO o.c.u.WarcRecordWriter [pool-15-thread-1] WARC response record https://www.h.country/ (2026-05-20T12:30:47Z, status: 301, size: 167)
2026-05-20 14:30:48,974 INFO o.c.u.WarcRecordWriter [pool-15-thread-1] WARC response record https://www.h.country/robots.txt (2026-05-20T12:30:43Z, status: 301, size: 167)
2026-05-20 14:30:48,976 INFO o.c.u.WarcRecordWriter [pool-15-thread-1] WARC response record https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots (2026-05-20T12:30:47Z, status: 404, size: 415)
2026-05-20 14:30:48,977 INFO o.c.u.WarcRecordWriter [pool-15-thread-1] WARC response record https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots (2026-05-20T12:30:47Z, status: 404, size: 415)

@sebastian-nagel sebastian-nagel left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @lfoppiano! Looks generally good.

Two major points to consider

  • eventually, drop the raw URL, which would reduce the number of modified classes significantly
  • might move some methods to the interface
    See inline comments for details.

Comment thread src/java/org/apache/nutch/net/protocols/Response.java Outdated
Comment thread src/java/org/apache/nutch/net/protocols/Response.java Outdated
Comment thread src/java/org/apache/nutch/net/protocols/Response.java Outdated
Comment thread src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java Outdated
@lfoppiano lfoppiano force-pushed the feature/store-internal-url-in-response-metadata branch from 70872b9 to ff1fd25 Compare May 22, 2026 11:14
@lfoppiano lfoppiano marked this pull request as ready for review May 22, 2026 11:17
@lfoppiano lfoppiano requested a review from sebastian-nagel May 22, 2026 11:24
@lfoppiano

Copy link
Copy Markdown
Author

Thanks! I wasn't expecting this PR to be so close to the completion. I've made all modifications. I still haven't figure out why the robots.txt is not visited in comment: #56 (comment), though

@lfoppiano

Copy link
Copy Markdown
Author

Following up my previous comment, after setting the log level to DEBUG I've got the following.

main/master

2026-05-25 16:41:06,400 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 53 fetching https://www.h.country/ (queue crawl delay=3600ms)
2026-05-25 16:41:06,400 DEBUG o.a.n.f.FetcherThread [FetcherThread] redirectCount=0
2026-05-25 16:41:06,401 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:41:06,401 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 49 Using queue mode : byHost
2026-05-25 16:41:06,402 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 54 has no more work available
2026-05-25 16:41:06,403 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 54 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:41:06,403 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:41:06,403 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 49 Using queue mode : byHost
2026-05-25 16:41:06,403 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 55 has no more work available
2026-05-25 16:41:06,404 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 55 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:41:06,405 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:41:06,405 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 49 Using queue mode : byHost
2026-05-25 16:41:06,405 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 56 has no more work available
2026-05-25 16:41:06,405 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 56 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:41:06,406 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:41:06,407 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 49 Using queue mode : byHost
2026-05-25 16:41:06,407 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 has no more work available
2026-05-25 16:41:06,407 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:41:06,408 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:41:06,409 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 49 Using queue mode : byHost
2026-05-25 16:41:06,409 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 58 has no more work available
2026-05-25 16:41:06,409 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 58 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:41:06,410 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:41:06,410 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 49 Using queue mode : byHost
2026-05-25 16:41:06,411 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 59 has no more work available
2026-05-25 16:41:06,411 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 59 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:41:06,412 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:41:06,412 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 49 Using queue mode : byHost
2026-05-25 16:41:06,412 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 60 has no more work available
2026-05-25 16:41:06,412 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 60 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:41:06,414 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:41:06,414 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 49 Using queue mode : byHost
2026-05-25 16:41:06,414 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 61 has no more work available
2026-05-25 16:41:06,414 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 61 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:41:06,416 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:41:06,416 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 49 Using queue mode : byHost
2026-05-25 16:41:06,416 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Fetcher: throughput threshold: 4
2026-05-25 16:41:06,416 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Fetcher: throughput threshold retries: 360
2026-05-25 16:41:06,416 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 62 has no more work available
2026-05-25 16:41:06,416 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 62 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:41:06,419 INFO c.d.EffectiveTldFinder [FetcherThread] Loading public suffix list from class path: file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/effective_tld_names.dat
2026-05-25 16:41:06,427 INFO c.d.EffectiveTldFinder [FetcherThread] Public suffix list VERSION: 2026-02-22_12-29-42_UTC
2026-05-25 16:41:06,427 INFO c.d.EffectiveTldFinder [FetcherThread] Public suffix list COMMIT: 35aa65b9d2ea34cdfa7ece1432603181cf2258de
2026-05-25 16:41:06,481 INFO c.d.EffectiveTldFinder [FetcherThread] Successfully read public suffix list: 330807 bytes, 16285 lines, 10142 rules
2026-05-25 16:41:06,483 INFO c.d.EffectiveTldFinder [FetcherThread] Digest of public suffix list: MD5 = 6A58B22E28487D9A988DDBE0922A9365
2026-05-25 16:41:06,483 INFO c.d.EffectiveTldFinder [FetcherThread] Digest of public suffix list: SHA-512 = 854CA933175CBC2E299BBF1EF6FB16FCC4DF76A7628F39C639A93F7D3EBE49F0F35F218D86FA2FB7CF963321C67AA7FC2232D92521CB4E009F359921397FF495
2026-05-25 16:41:06,519 DEBUG o.a.t.c.TikaConfig [FetcherThread] loading tika config from defaults; no config file specified
2026-05-25 16:41:07,090 DEBUG o.a.t.p.e.ExternalParser [FetcherThread] exit value for tesseract: 1
2026-05-25 16:41:07,090 DEBUG o.a.t.p.o.TesseractOCRParser [FetcherThread] hasTesseract (path: [tesseract]): true
2026-05-25 16:41:07,264 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$1@6b2e0f78]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:330)
	at org.apache.hadoop.mapreduce.Job.isUber(Job.java:1867)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1748)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:41:07,266 INFO o.a.h.m.Job [main] Job job_local994142331_0001 running in uber mode : false
2026-05-25 16:41:07,266 INFO o.a.h.m.Job [main]  map 0% reduce 0%
2026-05-25 16:41:07,267 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$6@5e746d37]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:731)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1760)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:41:07,268 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$1@4016ccc1]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:330)
	at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:614)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1737)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:41:07,268 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$1@3ffb3598]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:330)
	at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:614)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1738)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:41:07,412 DEBUG o.a.t.p.e.ExternalParser [FetcherThread] exit value for convert: 1
2026-05-25 16:41:07,421 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
2026-05-25 16:41:07,523 INFO o.a.n.p.RobotRulesParser [FetcherThread] Checking robots.txt for the following agent names: [nutch-test-bot]
2026-05-25 16:41:07,523 INFO o.a.n.p.RobotRulesParser [FetcherThread] Following max. 5 robots.txt redirects
2026-05-25 16:41:07,523 DEBUG o.a.n.p.RobotRulesParser [FetcherThread] robots.txt allowlist not configured.
2026-05-25 16:41:07,569 INFO o.a.n.p.o.OkHttp [FetcherThread] http.proxy.host = null
2026-05-25 16:41:07,569 INFO o.a.n.p.o.OkHttp [FetcherThread] http.proxy.port = 8080
2026-05-25 16:41:07,569 INFO o.a.n.p.o.OkHttp [FetcherThread] http.proxy.exception.list = false
2026-05-25 16:41:07,569 INFO o.a.n.p.o.OkHttp [FetcherThread] http.timeout = 45000 ms
2026-05-25 16:41:07,569 INFO o.a.n.p.o.OkHttp [FetcherThread] http.time.limit = 300 seconds
2026-05-25 16:41:07,569 INFO o.a.n.p.o.OkHttp [FetcherThread] http.content.limit = 5242880 bytes
2026-05-25 16:41:07,569 INFO o.a.n.p.o.OkHttp [FetcherThread] http.agent = Nutch-test-bot/Nutch-1.22-SNAPSHOT
2026-05-25 16:41:07,569 INFO o.a.n.p.o.OkHttp [FetcherThread] http.accept.language = en-US,en;q=0.5
2026-05-25 16:41:07,569 INFO o.a.n.p.o.OkHttp [FetcherThread] http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2026-05-25 16:41:07,569 INFO o.a.n.p.o.OkHttp [FetcherThread] http.enable.cookie.header = false
2026-05-25 16:41:07,600 INFO o.a.n.p.o.IPFilterRules [FetcherThread] Found 1 IP filter rules for http.filter.ipaddress.exclude
2026-05-25 16:41:07,601 INFO o.a.n.p.o.OkHttp [FetcherThread] Using 128 connection pools with max. 256 idle connections and 300 sec. connection keep-alive time
2026-05-25 16:41:08,022 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://www.h.country/robots.txt - h2 301 
2026-05-25 16:41:08,022 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 167
2026-05-25 16:41:08,023 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:41:08,023 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 167 bytes out of 167 buffered, remaining 0 bytes in buffer
2026-05-25 16:41:08,054 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Following robots.txt redirect: https://www.h.country/robots.txt -> https://harmony.one/human
2026-05-25 16:41:08,269 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$6@3b2f4a93]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:731)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1760)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:41:08,269 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$1@470a659f]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:330)
	at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:614)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1737)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:41:08,269 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$1@4a23350]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:330)
	at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:614)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1738)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:41:08,426 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
2026-05-25 16:41:08,441 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://harmony.one/human - h2 301 
2026-05-25 16:41:08,441 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 0
2026-05-25 16:41:08,444 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:41:08,444 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 0 bytes out of 0 buffered, remaining 0 bytes in buffer
2026-05-25 16:41:08,446 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Following robots.txt redirect: https://harmony.one/human -> https://🧠.s.country/p/human-protocol-aligning-hearts-bots
2026-05-25 16:41:09,043 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://🧠.s.country/p/human-protocol-aligning-hearts-bots - h2 404 
2026-05-25 16:41:09,044 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 415
2026-05-25 16:41:09,044 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:41:09,044 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 415 bytes out of 415 buffered, remaining 0 bytes in buffer
2026-05-25 16:41:09,064 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Fetched robots.txt for https://www.h.country/ with status code 404
2026-05-25 16:41:09,064 DEBUG o.a.n.f.FetcherThread [FetcherThread] Fetched and stored robots.txt https://www.h.country/robots.txt
2026-05-25 16:41:09,064 DEBUG o.a.n.f.FetcherThread [FetcherThread] Fetched and stored robots.txt https://harmony.one/human
2026-05-25 16:41:09,065 DEBUG o.a.n.f.FetcherThread [FetcherThread] Fetched and stored robots.txt https://🧠.s.country/p/human-protocol-aligning-hearts-bots
2026-05-25 16:41:09,103 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://www.h.country/ - h2 301 
2026-05-25 16:41:09,104 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 167
2026-05-25 16:41:09,104 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:41:09,104 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 167 bytes out of 167 buffered, remaining 0 bytes in buffer
2026-05-25 16:41:09,110 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [^(?:file|ftp|mailto):] for host null and domain null
2026-05-25 16:41:09,110 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$] for host null and domain null
2026-05-25 16:41:09,111 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [[?*!@=]] for host null and domain null
2026-05-25 16:41:09,111 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [.*(/[^/]+)/[^/]+\1/[^/]+\1/] for host null and domain null
2026-05-25 16:41:09,111 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [.] for host null and domain null
2026-05-25 16:41:09,111 DEBUG o.a.n.f.FetcherThread [FetcherThread]  - protocol redirect to https://harmony.one/human (fetching now)
2026-05-25 16:41:09,113 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 53 fetching https://harmony.one/human (queue crawl delay=3600ms)
2026-05-25 16:41:09,113 DEBUG o.a.n.f.FetcherThread [FetcherThread] redirectCount=1
2026-05-25 16:41:09,274 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$6@6fca5907]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:731)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1760)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:41:09,275 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$1@7bebcd65]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:330)
	at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:614)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1737)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:41:09,275 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$1@7afb1741]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:330)
	at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:614)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1738)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:41:09,432 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=2
2026-05-25 16:41:09,487 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://harmony.one/robots.txt - h2 200 
2026-05-25 16:41:09,519 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 65
2026-05-25 16:41:09,519 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:41:09,519 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 65 bytes out of 65 buffered, remaining 0 bytes in buffer
2026-05-25 16:41:09,524 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Fetched robots.txt for https://harmony.one/human with status code 200
2026-05-25 16:41:09,535 DEBUG o.a.n.f.FetcherThread [FetcherThread] Fetched and stored robots.txt https://harmony.one/robots.txt
2026-05-25 16:41:09,568 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://harmony.one/human - h2 301 
2026-05-25 16:41:09,568 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 0
2026-05-25 16:41:09,568 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:41:09,568 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 0 bytes out of 0 buffered, remaining 0 bytes in buffer
2026-05-25 16:41:09,569 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [^(?:file|ftp|mailto):] for host null and domain null
2026-05-25 16:41:09,569 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$] for host null and domain null
2026-05-25 16:41:09,569 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [[?*!@=]] for host null and domain null
2026-05-25 16:41:09,569 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [.*(/[^/]+)/[^/]+\1/[^/]+\1/] for host null and domain null
2026-05-25 16:41:09,570 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [.] for host null and domain null
2026-05-25 16:41:09,570 DEBUG o.a.n.f.FetcherThread [FetcherThread]  - protocol redirect to https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots (fetching now)
2026-05-25 16:41:09,570 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 53 fetching https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots (queue crawl delay=3600ms)
2026-05-25 16:41:09,570 DEBUG o.a.n.f.FetcherThread [FetcherThread] redirectCount=2
2026-05-25 16:41:09,913 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://xn--qv9h.s.country/robots.txt - h2 404 
2026-05-25 16:41:09,914 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 415
2026-05-25 16:41:09,914 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:41:09,914 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 415 bytes out of 415 buffered, remaining 0 bytes in buffer
2026-05-25 16:41:09,920 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Fetched robots.txt for https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots with status code 404
2026-05-25 16:41:09,921 DEBUG o.a.n.f.FetcherThread [FetcherThread] Fetched and stored robots.txt https://xn--qv9h.s.country/robots.txt
2026-05-25 16:41:09,981 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots - h2 404 
2026-05-25 16:41:09,981 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 415
2026-05-25 16:41:09,981 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:41:09,981 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 415 bytes out of 415 buffered, remaining 0 bytes in buffer
2026-05-25 16:41:09,993 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 53 has no more work available
2026-05-25 16:41:09,994 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 53 -finishing thread FetcherThread, activeThreads=0
2026-05-25 16:41:10,281 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$6@31edeac]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:731)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1760)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:41:10,281 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$1@45bb2aa1]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:330)
	at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:614)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1737)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:41:10,282 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$1@4b1a43d8]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:330)
	at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:614)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1738)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:41:10,437 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
2026-05-25 16:41:10,438 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] -activeThreads=0
2026-05-25 16:41:10,447 INFO o.a.h.m.LocalJobRunner [LocalJobRunner Map Task Executor #0] 
2026-05-25 16:41:10,451 INFO o.a.h.m.MapTask [LocalJobRunner Map Task Executor #0] Starting flush of map output
2026-05-25 16:41:10,451 INFO o.a.h.m.MapTask [LocalJobRunner Map Task Executor #0] Spilling map output
2026-05-25 16:41:10,451 INFO o.a.h.m.MapTask [LocalJobRunner Map Task Executor #0] bufstart = 0; bufend = 16638; bufvoid = 104857600
2026-05-25 16:41:10,451 INFO o.a.h.m.MapTask [LocalJobRunner Map Task Executor #0] kvstart = 26214396(104857584); kvend = 26214356(104857424); length = 41/6553600
2026-05-25 16:41:10,479 DEBUG o.a.h.f.s.i.IOStatisticsContextIntegration [LocalJobRunner Map Task Executor #0] Reference lost for threadID for the context: 49
2026-05-25 16:41:10,479 DEBUG o.a.h.f.s.i.IOStatisticsContextIntegration [LocalJobRunner Map Task Executor #0] Created instance IOStatisticsContextImpl{id=5, threadId=49, ioStatistics=counters=();
gauges=();
minimums=();
maximums=();
means=();
}
2026-05-25 16:41:10,488 INFO o.a.h.m.MapTask [LocalJobRunner Map Task Executor #0] Finished spill 0
2026-05-25 16:41:10,500 INFO o.a.h.m.Task [LocalJobRunner Map Task Executor #0] Task:attempt_local994142331_0001_m_000000_0 is done. And is in the process of committing
2026-05-25 16:41:10,502 INFO o.a.h.m.LocalJobRunner [LocalJobRunner Map Task Executor #0] 0 threads (0 waiting), 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.00 pages/s (0 last sec), 0 kbits/s (0 last sec)
2026-05-25 16:41:10,502 INFO o.a.h.m.Task [LocalJobRunner Map Task Executor #0] Task 'attempt_local994142331_0001_m_000000_0' done.
2026-05-25 16:41:10,505 INFO o.a.h.m.Task [LocalJobRunner Map Task Executor #0] Final Counters for attempt_local994142331_0001_m_000000_0: Counters: 47

this branch:

2026-05-25 16:45:46,917 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 50 Using queue mode : byHost
2026-05-25 16:45:46,923 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 54 fetching https://www.h.country/ (queue crawl delay=3600ms)
2026-05-25 16:45:46,923 DEBUG o.a.n.f.FetcherThread [FetcherThread] redirectCount=0
2026-05-25 16:45:46,924 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:45:46,925 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 50 Using queue mode : byHost
2026-05-25 16:45:46,926 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 55 has no more work available
2026-05-25 16:45:46,927 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:45:46,927 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 50 Using queue mode : byHost
2026-05-25 16:45:46,927 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 55 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:45:46,928 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 56 has no more work available
2026-05-25 16:45:46,928 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 56 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:45:46,929 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:45:46,929 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 50 Using queue mode : byHost
2026-05-25 16:45:46,930 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 has no more work available
2026-05-25 16:45:46,930 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:45:46,931 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:45:46,931 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 50 Using queue mode : byHost
2026-05-25 16:45:46,932 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 58 has no more work available
2026-05-25 16:45:46,932 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 58 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:45:46,933 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:45:46,933 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 50 Using queue mode : byHost
2026-05-25 16:45:46,934 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 59 has no more work available
2026-05-25 16:45:46,934 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 59 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:45:46,935 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:45:46,935 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 50 Using queue mode : byHost
2026-05-25 16:45:46,935 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 60 has no more work available
2026-05-25 16:45:46,936 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 60 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:45:46,937 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:45:46,937 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 50 Using queue mode : byHost
2026-05-25 16:45:46,937 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 61 has no more work available
2026-05-25 16:45:46,937 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 61 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:45:46,939 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:45:46,939 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 50 Using queue mode : byHost
2026-05-25 16:45:46,939 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 62 has no more work available
2026-05-25 16:45:46,939 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 62 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:45:46,941 INFO o.a.h.c.Configuration [LocalJobRunner Map Task Executor #0] found resource host-protocol-mapping.txt at file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/host-protocol-mapping.txt
2026-05-25 16:45:46,941 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 50 Using queue mode : byHost
2026-05-25 16:45:46,941 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Fetcher: throughput threshold: 4
2026-05-25 16:45:46,941 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Fetcher: throughput threshold retries: 360
2026-05-25 16:45:46,941 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 63 has no more work available
2026-05-25 16:45:46,941 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 63 -finishing thread FetcherThread, activeThreads=1
2026-05-25 16:45:46,943 INFO c.d.EffectiveTldFinder [FetcherThread] Loading public suffix list from class path: file:/Users/lfoppiano/development/projects/cc/nutch/runtime/local/conf/effective_tld_names.dat
2026-05-25 16:45:46,951 INFO c.d.EffectiveTldFinder [FetcherThread] Public suffix list VERSION: 2026-02-22_12-29-42_UTC
2026-05-25 16:45:46,951 INFO c.d.EffectiveTldFinder [FetcherThread] Public suffix list COMMIT: 35aa65b9d2ea34cdfa7ece1432603181cf2258de
2026-05-25 16:45:47,010 INFO c.d.EffectiveTldFinder [FetcherThread] Successfully read public suffix list: 330807 bytes, 16285 lines, 10142 rules
2026-05-25 16:45:47,012 INFO c.d.EffectiveTldFinder [FetcherThread] Digest of public suffix list: MD5 = 6A58B22E28487D9A988DDBE0922A9365
2026-05-25 16:45:47,012 INFO c.d.EffectiveTldFinder [FetcherThread] Digest of public suffix list: SHA-512 = 854CA933175CBC2E299BBF1EF6FB16FCC4DF76A7628F39C639A93F7D3EBE49F0F35F218D86FA2FB7CF963321C67AA7FC2232D92521CB4E009F359921397FF495
2026-05-25 16:45:47,037 DEBUG o.a.t.c.TikaConfig [FetcherThread] loading tika config from defaults; no config file specified
2026-05-25 16:45:47,196 DEBUG o.a.t.p.e.ExternalParser [FetcherThread] exit value for tesseract: 1
2026-05-25 16:45:47,196 DEBUG o.a.t.p.o.TesseractOCRParser [FetcherThread] hasTesseract (path: [tesseract]): true
2026-05-25 16:45:47,218 DEBUG o.a.t.p.e.ExternalParser [FetcherThread] exit value for convert: 1
2026-05-25 16:45:47,318 INFO o.a.n.p.RobotRulesParser [FetcherThread] Checking robots.txt for the following agent names: [nutch-test-bot]
2026-05-25 16:45:47,318 INFO o.a.n.p.RobotRulesParser [FetcherThread] Following max. 5 robots.txt redirects
2026-05-25 16:45:47,318 DEBUG o.a.n.p.RobotRulesParser [FetcherThread] robots.txt allowlist not configured.
2026-05-25 16:45:47,364 INFO o.a.n.p.o.OkHttp [FetcherThread] http.proxy.host = null
2026-05-25 16:45:47,364 INFO o.a.n.p.o.OkHttp [FetcherThread] http.proxy.port = 8080
2026-05-25 16:45:47,364 INFO o.a.n.p.o.OkHttp [FetcherThread] http.proxy.exception.list = false
2026-05-25 16:45:47,364 INFO o.a.n.p.o.OkHttp [FetcherThread] http.timeout = 45000 ms
2026-05-25 16:45:47,364 INFO o.a.n.p.o.OkHttp [FetcherThread] http.time.limit = 300 seconds
2026-05-25 16:45:47,364 INFO o.a.n.p.o.OkHttp [FetcherThread] http.content.limit = 5242880 bytes
2026-05-25 16:45:47,365 INFO o.a.n.p.o.OkHttp [FetcherThread] http.agent = Nutch-test-bot/Nutch-1.22-SNAPSHOT
2026-05-25 16:45:47,365 INFO o.a.n.p.o.OkHttp [FetcherThread] http.accept.language = en-US,en;q=0.5
2026-05-25 16:45:47,365 INFO o.a.n.p.o.OkHttp [FetcherThread] http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2026-05-25 16:45:47,365 INFO o.a.n.p.o.OkHttp [FetcherThread] http.enable.cookie.header = false
2026-05-25 16:45:47,395 INFO o.a.n.p.o.IPFilterRules [FetcherThread] Found 1 IP filter rules for http.filter.ipaddress.exclude
2026-05-25 16:45:47,397 INFO o.a.n.p.o.OkHttp [FetcherThread] Using 128 connection pools with max. 256 idle connections and 300 sec. connection keep-alive time
2026-05-25 16:45:47,727 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://www.h.country/robots.txt - h2 301 
2026-05-25 16:45:47,727 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 167
2026-05-25 16:45:47,727 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:45:47,727 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 167 bytes out of 167 buffered, remaining 0 bytes in buffer
2026-05-25 16:45:47,756 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Following robots.txt redirect: https://www.h.country/robots.txt -> https://harmony.one/human
2026-05-25 16:45:47,795 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$1@3ec2ecea]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:330)
	at org.apache.hadoop.mapreduce.Job.isUber(Job.java:1867)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1748)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:45:47,795 INFO o.a.h.m.Job [main] Job job_local1576989335_0001 running in uber mode : false
2026-05-25 16:45:47,796 INFO o.a.h.m.Job [main]  map 0% reduce 0%
2026-05-25 16:45:47,796 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$6@7c9bdee9]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:731)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1760)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:45:47,797 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$1@2cec704c]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:330)
	at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:614)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1737)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:45:47,797 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$1@2416498e]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:330)
	at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:614)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1738)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:45:47,946 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
2026-05-25 16:45:47,995 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://harmony.one/human - h2 301 
2026-05-25 16:45:47,995 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 0
2026-05-25 16:45:47,995 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:45:47,995 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 0 bytes out of 0 buffered, remaining 0 bytes in buffer
2026-05-25 16:45:47,996 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Following robots.txt redirect: https://harmony.one/human -> https://🧠.s.country/p/human-protocol-aligning-hearts-bots
2026-05-25 16:45:48,281 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots - h2 404 
2026-05-25 16:45:48,282 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 415
2026-05-25 16:45:48,282 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:45:48,282 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 415 bytes out of 415 buffered, remaining 0 bytes in buffer
2026-05-25 16:45:48,292 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Fetched robots.txt for https://www.h.country/ with status code 404
2026-05-25 16:45:48,292 DEBUG o.a.n.f.FetcherThread [FetcherThread] Fetched and stored robots.txt https://www.h.country/robots.txt
2026-05-25 16:45:48,292 DEBUG o.a.n.f.FetcherThread [FetcherThread] Fetched and stored robots.txt https://harmony.one/human
2026-05-25 16:45:48,292 DEBUG o.a.n.f.FetcherThread [FetcherThread] Fetched and stored robots.txt https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots
2026-05-25 16:45:48,332 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://www.h.country/ - h2 301 
2026-05-25 16:45:48,335 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 167
2026-05-25 16:45:48,335 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:45:48,335 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 167 bytes out of 167 buffered, remaining 0 bytes in buffer
2026-05-25 16:45:48,339 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [^(?:file|ftp|mailto):] for host null and domain null
2026-05-25 16:45:48,339 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$] for host null and domain null
2026-05-25 16:45:48,339 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [[?*!@=]] for host null and domain null
2026-05-25 16:45:48,339 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [.*(/[^/]+)/[^/]+\1/[^/]+\1/] for host null and domain null
2026-05-25 16:45:48,339 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [.] for host null and domain null
2026-05-25 16:45:48,339 DEBUG o.a.n.f.FetcherThread [FetcherThread]  - protocol redirect to https://harmony.one/human (fetching now)
2026-05-25 16:45:48,341 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 54 fetching https://harmony.one/human (queue crawl delay=3600ms)
2026-05-25 16:45:48,341 DEBUG o.a.n.f.FetcherThread [FetcherThread] redirectCount=1
2026-05-25 16:45:48,643 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://harmony.one/robots.txt - h2 200 
2026-05-25 16:45:48,644 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 65
2026-05-25 16:45:48,644 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:45:48,644 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 65 bytes out of 65 buffered, remaining 0 bytes in buffer
2026-05-25 16:45:48,656 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Fetched robots.txt for https://harmony.one/human with status code 200
2026-05-25 16:45:48,657 DEBUG o.a.n.f.FetcherThread [FetcherThread] Fetched and stored robots.txt https://harmony.one/robots.txt
2026-05-25 16:45:48,683 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://harmony.one/human - h2 301 
2026-05-25 16:45:48,683 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 0
2026-05-25 16:45:48,683 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:45:48,683 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 0 bytes out of 0 buffered, remaining 0 bytes in buffer
2026-05-25 16:45:48,685 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [^(?:file|ftp|mailto):] for host null and domain null
2026-05-25 16:45:48,685 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$] for host null and domain null
2026-05-25 16:45:48,685 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [[?*!@=]] for host null and domain null
2026-05-25 16:45:48,685 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [.*(/[^/]+)/[^/]+\1/[^/]+\1/] for host null and domain null
2026-05-25 16:45:48,685 DEBUG o.a.n.u.a.RegexURLFilterBase [FetcherThread] Applying rule [.] for host null and domain null
2026-05-25 16:45:48,685 DEBUG o.a.n.f.FetcherThread [FetcherThread]  - protocol redirect to https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots (fetching now)
2026-05-25 16:45:48,686 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 54 fetching https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots (queue crawl delay=3600ms)
2026-05-25 16:45:48,686 DEBUG o.a.n.f.FetcherThread [FetcherThread] redirectCount=2
2026-05-25 16:45:48,720 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://xn--qv9h.s.country/robots.txt - h2 404 
2026-05-25 16:45:48,720 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 415
2026-05-25 16:45:48,720 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:45:48,720 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 415 bytes out of 415 buffered, remaining 0 bytes in buffer
2026-05-25 16:45:48,722 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Fetched robots.txt for https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots with status code 404
2026-05-25 16:45:48,722 DEBUG o.a.n.f.FetcherThread [FetcherThread] Fetched and stored robots.txt https://xn--qv9h.s.country/robots.txt
2026-05-25 16:45:48,789 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots - h2 404 
2026-05-25 16:45:48,790 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] total bytes requested = 8192, buffered = 415
2026-05-25 16:45:48,790 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] source exhausted, no more data to read
2026-05-25 16:45:48,790 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] copied 415 bytes out of 415 buffered, remaining 0 bytes in buffer
2026-05-25 16:45:48,791 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 54 has no more work available
2026-05-25 16:45:48,793 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 54 -finishing thread FetcherThread, activeThreads=0
2026-05-25 16:45:48,802 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$6@240f6c41]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:731)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1760)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:45:48,802 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$1@2015b2cd]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:330)
	at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:614)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1737)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:45:48,802 DEBUG o.a.h.s.UserGroupInformation [main] PrivilegedAction [as: lfoppiano (auth:SIMPLE)][action: org.apache.hadoop.mapreduce.Job$1@64693226]
java.lang.Exception
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
	at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:330)
	at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:614)
	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1738)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1699)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:580)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:629)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:602)
2026-05-25 16:45:48,952 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
2026-05-25 16:45:48,952 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] -activeThreads=0
2026-05-25 16:45:48,957 INFO o.a.h.m.LocalJobRunner [LocalJobRunner Map Task Executor #0] 
2026-05-25 16:45:48,959 INFO o.a.h.m.MapTask [LocalJobRunner Map Task Executor #0] Starting flush of map output
2026-05-25 16:45:48,960 INFO o.a.h.m.MapTask [LocalJobRunner Map Task Executor #0] Spilling map output
2026-05-25 16:45:48,960 INFO o.a.h.m.MapTask [LocalJobRunner Map Task Executor #0] bufstart = 0; bufend = 16765; bufvoid = 104857600
2026-05-25 16:45:48,960 INFO o.a.h.m.MapTask [LocalJobRunner Map Task Executor #0] kvstart = 26214396(104857584); kvend = 26214356(104857424); length = 41/6553600
2026-05-25 16:45:48,993 DEBUG o.a.h.f.s.i.IOStatisticsContextIntegration [LocalJobRunner Map Task Executor #0] Reference lost for threadID for the context: 50
2026-05-25 16:45:48,993 DEBUG o.a.h.f.s.i.IOStatisticsContextIntegration [LocalJobRunner Map Task Executor #0] Created instance IOStatisticsContextImpl{id=5, threadId=50, ioStatistics=counters=();
gauges=();
minimums=();
maximums=();
means=();
}
2026-05-25 16:45:49,005 INFO o.a.h.m.MapTask [LocalJobRunner Map Task Executor #0] Finished spill 0
2026-05-25 16:45:49,019 INFO o.a.h.m.Task [LocalJobRunner Map Task Executor #0] Task:attempt_local1576989335_0001_m_000000_0 is done. And is in the process of committing
2026-05-25 16:45:49,020 INFO o.a.h.m.LocalJobRunner [LocalJobRunner Map Task Executor #0] 0 threads (0 waiting), 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.00 pages/s (0 last sec), 0 kbits/s (0 last sec)
2026-05-25 16:45:49,020 INFO o.a.h.m.Task [LocalJobRunner Map Task Executor #0] Task 'attempt_local1576989335_0001_m_000000_0' done.
2026-05-25 16:45:49,024 INFO o.a.h.m.Task [LocalJobRunner Map Task Executor #0] Final Counters for attempt_local1576989335_0001_m_000000_0: Counters: 47

I cannot figure out what is going on and why the robots.txt is not visited in this branch.

@sebastian-nagel

Copy link
Copy Markdown

why the robots.txt is not visited in this branch.

I cannot see any differences, the robots.txt is visited in both variants:

  1. (unchanged)
    2026-05-25 16:41:09,913 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://xn--qv9h.s.country/robots.txt - h2 404 
    2026-05-25 16:41:09,921 DEBUG o.a.n.f.FetcherThread [FetcherThread] Fetched and stored robots.txt https://xn--qv9h.s.country/robots.txt
    
  2. (this PR)
    2026-05-25 16:45:48,720 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] https://xn--qv9h.s.country/robots.txt - h2 404 
    2026-05-25 16:45:48,722 DEBUG o.a.n.f.FetcherThread [FetcherThread] Fetched and stored robots.txt https://xn--qv9h.s.country/robots.txt
    

@lfoppiano

Copy link
Copy Markdown
Author

Sorry, the difference is in the resulting cdx / warc file, where the record

country,s,xn--qv9h)/robots.txt 20260526071921 {"url": "https://xn--qv9h.s.country/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "404", "digest": "QHVUZHXT6TSMSHCXASTZN45UWCAFNFNT", "length": "1116", "offset": "3365", "filename": "Users/lfoppiano/development/projects/cc/nutch/crawl_punycodes/segments/20260526081912/robotstxt/NUTCH-CRAWL-20260526071915-20260526101915-00000.warc.gz"}

is not present in our branch.

master/main/cc:

one,harmony)/human 20260526071919 {"url": "https://harmony.one/human", "mime": "unk", "mime-detected": "application/octet-stream", "status": "301", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "698", "offset": "826", "filename": "Users/lfoppiano/development/projects/cc/nutch/crawl_punycodes/segments/20260526081912/robotstxt/NUTCH-CRAWL-20260526071915-20260526101915-00000.warc.gz", "redirect": "https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots"}
country,h)/robots.txt 20260526071916 {"url": "https://www.h.country/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "OQ3OBNFR7DBCFQ4ANGEQW5P2FOXZZJRA", "length": "895", "offset": "1995", "filename": "Users/lfoppiano/development/projects/cc/nutch/crawl_punycodes/segments/20260526081912/robotstxt/NUTCH-CRAWL-20260526071915-20260526101915-00000.warc.gz", "redirect": "https://harmony.one/human"}
country,s,xn--qv9h)/robots.txt 20260526071921 {"url": "https://xn--qv9h.s.country/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "404", "digest": "QHVUZHXT6TSMSHCXASTZN45UWCAFNFNT", "length": "1116", "offset": "3365", "filename": "Users/lfoppiano/development/projects/cc/nutch/crawl_punycodes/segments/20260526081912/robotstxt/NUTCH-CRAWL-20260526071915-20260526101915-00000.warc.gz"}
country,s,%f0%9f%a7%a0)/p/human-protocol-aligning-hearts-bots 20260526071919 {"url": "https://%F0%9F%A7%A0.s.country/p/human-protocol-aligning-hearts-bots", "mime": "text/html", "mime-detected": "text/html", "status": "404", "digest": "QHVUZHXT6TSMSHCXASTZN45UWCAFNFNT", "length": "1137", "offset": "4984", "filename": "Users/lfoppiano/development/projects/cc/nutch/crawl_punycodes/segments/20260526081912/robotstxt/NUTCH-CRAWL-20260526071915-20260526101915-00000.warc.gz"}

this branch:

one,harmony)/human 20260526071720 {"url": "https://harmony.one/human", "mime": "unk", "mime-detected": "application/octet-stream", "status": "301", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "704", "offset": "825", "filename": "Users/lfoppiano/development/projects/cc/nutch/crawl_punycodes/segments/20260526081715/robotstxt/NUTCH-CRAWL-20260526071717-20260526101717-00000.warc.gz", "redirect": "https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots"}
country,h)/robots.txt 20260526071720 {"url": "https://www.h.country/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "OQ3OBNFR7DBCFQ4ANGEQW5P2FOXZZJRA", "length": "901", "offset": "1998", "filename": "Users/lfoppiano/development/projects/cc/nutch/crawl_punycodes/segments/20260526081715/robotstxt/NUTCH-CRAWL-20260526071717-20260526101717-00000.warc.gz", "redirect": "https://harmony.one/human"}
country,s,xn--qv9h)/p/human-protocol-aligning-hearts-bots 20260526071722 {"url": "https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots", "mime": "text/html", "mime-detected": "text/html", "status": "404", "digest": "QHVUZHXT6TSMSHCXASTZN45UWCAFNFNT", "length": "1121", "offset": "3390", "filename": "Users/lfoppiano/development/projects/cc/nutch/crawl_punycodes/segments/20260526081715/robotstxt/NUTCH-CRAWL-20260526071717-20260526101717-00000.warc.gz"}

I probably need to debug it 😅

@lfoppiano lfoppiano changed the title Store internal URL in response metadata (covering more than okhttp-protocol plugin) Allow caller to retrieve the URL used for the request May 26, 2026
@lfoppiano

lfoppiano commented May 28, 2026

Copy link
Copy Markdown
Author

I think I found why in this branch we don't have https://xn--qv9h.s.country/robots.txt.

There is a hit on a cache to fetch the robotsRules at FetcherThread:436:

            BaseRobotRules rules = protocol.getRobotRules(fit.u, fit.datum,
                robotsTxtContent);
            if (robotsTxtContent != null) {
              outputRobotsTxt(robotsTxtContent);
              robotsTxtContent.clear();
            }

fit = {FetchItem@6480}
outlinkDepth = 0
queueID = "xn--qv9h.s.country"
url = {Text@6484} "https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots"
u = {URL@6482} "https://xn--qv9h.s.country/p/human-protocol-aligning-hearts-bots"

this reach HttpRobotRulesParser.getRobotRuleSet(...):

@Override
 public BaseRobotRules getRobotRulesSet(Protocol http, URL url,
     List<Content> robotsTxtContent) {

   if (LOG.isTraceEnabled() && isAllowListed(url)) {
     LOG.trace("Ignoring robots.txt (host is allowlisted) for URL: {}", url);
   }

   String cacheKey = getCacheKey(url);
   BaseRobotRules robotRules = CACHE.get(cacheKey);

   if (robotRules != null) {
     return robotRules; // cached rule
   } else if (LOG.isTraceEnabled()) {
     LOG.trace("Robots.txt cache miss {}", url);
   }

In master, the cache is not hit, because the resolved url is %F0%9F%A7%A0.s.country/p/human-protocol-... which has been alraedy visited but because the same URL is resolved differently (can this be due to the fact that the robots.txt URLs are not passing through the URL normalizer?) because of a mismatch of resolution of the URL. In this branch the cache is hit so the RobotRule is returned.

I still don't know what makes this latest visiting to https://xn--qv9h.s.country/robots.txt, to be further investigated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants